Survey of Experience Replay

Hiroyuki Yamada

List of Papers Surveyed

1. Competitive Experience Replay (CER)

H. Liu et. al., “Competitive Experience Replay”, ICLR (2019) ( arXiv)

1. Competitive Experience Replay (CER)

H. Liu et. al., “Competitive Experience Replay”, ICLR (2019) ( arXiv)

  • For goal-oriented sparse reward task, CER improves HER. (Use together)
  • Train a pair of agent $\pi _A$ and $\pi _B$ together
  • Reward re-labelling for mini-batch
    • $r^i _A \leftarrow r^i _A -1$ if ${}^\exists j, |s^i _A - s^j _B| < \delta$
    • $r^j _B \leftarrow r^j _B +N$ where $N$ is the number of $s^i _A$ satisfying $|s^i _A - s^j _B| < \delta$
  • Two types of initializations for $\pi _B$
    • indepentent-CER: $s^0 _B \sim p(s^0)$ (Usual task’s initial distribution)
    • interect-CER: $s^0 _B \sim p(s^i _A)$ (Random off-policy sample of $\pi _A$)

1. Competitive Experience Replay (CER)

H. Liu et. al., “Competitive Experience Replay”, ICLR (2019) ( arXiv)

  • HER+CER superiors others

1. Competitive Experience Replay (CER)

H. Liu et. al., “Competitive Experience Replay”, ICLR (2019) ( arXiv)

  • HER+CER superiors others

2. Remember and Forget for Experience Replay (ReF-ER)

G. Novati and P. Koumoutsakos, “Remember and Forget for Experience Replay”, ICML (2019) ( arXiv, code)

2. Remember and Forget for Experience Replay (ReF-ER)

G. Novati and P. Koumoutsakos, “Remember and Forget for Experience Replay”, ICML (2019) ( arXiv, code)

  • Define Importance as $\rho _t= \frac{\pi (a_t \mid s_t)}{\mu _t (a_t \mid s_t)}$ where $\pi$: current policy $\mu _t$: behavior policy.
  • If $\frac{1}{c_{\text{max}}} < \rho _t < c _{\text{max}} $, then “near-policy”, otherwise “far-policy”
  • For “far-policy”, gradient is clipped to 0: $\hat{g}(w) \to 0$
  • Penalize “off-policyness” by KL divergence: $\hat{g}^D(w) = \mathbb{E} [\nabla D_{\text{KL}}(\mu _k(\cdot\mid s_k)| \pi ^w (\cdot \mid s_k))]$
  • Total gradients: $\hat{g}^{\text{ReF-ER}} = \beta \hat{g}(w) - (1-\beta) \hat{g}^D(w)$
    • Anealing parameter: $\beta \leftarrow \cases{ (1-\eta) \beta & if $ \frac{n_{\text{far}}}{N} < D $ \cr (1-\eta)\beta + \eta & otherwise}$
    • $\eta$: NN learning rate
    • $n_{\text{far}}$: the number of “far-policy”
    • $N$: the number of transitions in Replay Buffer
    • $D$: hyper parameter

2. Remember and Forget for Experience Replay (ReF-ER)

G. Novati and P. Koumoutsakos, “Remember and Forget for Experience Replay”, ICML (2019) ( arXiv, code)

  • Results for DDPG with ReF-ER
  • The authors says “replacing ER with ReF-ER stabilizes DDPG and greatly improves its performance, especially for tasks with complex dynamics (e.g. Humanoid and Ant)”
  • The authors also executed NAF with ReF-ER, and V-PACER with ReF-ER

3. Experience Replay Optimization (ERO)

D. Zha et. al., “Experience Replay Optimization”, IJCAI (2019) 4243-4249 ( arXiv)

3. Experience Replay Optimization (ERO)

D. Zha et. al., “Experience Replay Optimization”, IJCAI (2019) 4243-4249 ( arXiv)

  1. Replay Policy infers mask probabilities using features extracted from transition: $\boldsymbol{\lambda} = \lbrace \phi (\mathbf{f}_{\mathcal{B} _i}\mid \theta ^{\phi}) \mid \mathcal{B} _i \in \mathbb{R}^{N} \rbrace$
  2. Replay Buffer masks with Bernoulli distribution: $\mathbf{I} \sim \text{Bernoulli}(\boldsymbol{\lambda})$
  3. Agent trains with transitions uniformly sampled from masked Replay Buffer
  4. Replay Policy trains with:
    1. Replay Reward (difference of culmutive episode reward): $r^r = r^{c}_{\pi} - r^{c} _{\pi ^{\prime}}$
    2. $\nabla _{\theta ^{\phi}} \mathcal{J} \approx \sum _ {j:\mathcal{B}_j \in B^{\text{batch}}} r^{r} \nabla _{\theta ^{\phi}} [\mathbf{I}_j \log \phi + (1-\mathbf{I}_j)\log (1-\phi)] $

3. Experience Replay Optimization (ERO)

D. Zha et. al., “Experience Replay Optimization”, IJCAI (2019) 4243-4249 ( arXiv)

  • Average return of 5 times runs
  • The authors say “ERO consistently outperforms all the baselines on most of the continuous control tasks in terms of sample efficiency”

4. Attentive Experience Replay (AER)

P. Sun et. al., “Attentive Experience Replay”, AAAI (2020) 34, 5900-5907

4. Attentive Experience Replay (AER)

P. Sun et. al., “Attentive Experience Replay”, AAAI (2020) 34, 5900-5907

  • Sample $\lambda \times k$ transitions from Replay Buffer uniformly
  • Calculate (task dependent) similarity $\mathcal{F}(s_j,s_t)$
    • For MuJoCo: $\mathcal{F}(s_1,s_2) = \frac{s_1\cdot s_2}{| s_1| |s_2 |}$
    • For Atari 2600: $\mathcal{F}(s_1,s_2) = - | \phi (s1) - \phi (s_2) |_2$
  • Select most similar $k$ samples as mini-batch
  • Aneal $\lambda$ linearly: $\lambda _0 \to 1$ within $\alpha \cdot T$ steps
    • $T$: total steps
    • $\alpha < 1$: hyper parameter

4. Attentive Experience Replay (AER)

P. Sun et. al., “Attentive Experience Replay”, AAAI (2020) 34, 5900-5907

4. Attentive Experience Replay (AER)

P. Sun et. al., “Attentive Experience Replay”, AAAI (2020) 34, 5900-5907

5. Dynamic Experience Replay (DER)

J. Luo et. al., “Dynamic Experience Replay”, CoRL (2020) ( arXiv)

5. Dynamic Experience Replay (DER)

J. Luo et. al., “Dynamic Experience Replay”, CoRL (2020) ( arXiv)

  • DER auguments human demonstrations by utilizing succsessful transisions
  • Multiple Replay Buffers $\lbrace \mathbb{B} _0 \dots\mathbb{B} _n \rbrace$ with demonstration zone
  • On episode end, worker stores transitions into one of the replay buffers $\mathbb{B}_i$ randomly picked up.
  • If the episode succeeds, these transitions are stored to anothor replay buffer $\mathbb{T}$, too.
  • Trainer randomly picks one of the buffers $\mathbb{B}_j$, then samples, trains, and updates priorities.
  • Periodically, demonstration zones in $\mathbb{B}_k$ are replaced by random samples of successful transitions $\mathbb{T}$.

5. Dynamic Experience Replay (DER)

J. Luo et. al., “Dynamic Experience Replay”, CoRL (2020) ( arXiv)

5. Dynamic Experience Replay (DER)

J. Luo et. al., “Dynamic Experience Replay”, CoRL (2020) ( arXiv)

6. Neural Experience Replay Sampler (NERS)

Y. Oh et. al., “Learning to Sample with Local and Global Contexts in Experience Replay Buffer”,ICLR (2021) ( arXiv)

6. Neural Experience Replay Sampler (NERS)

Y. Oh et. al., “Learning to Sample with Local and Global Contexts in Experience Replay Buffer”,ICLR (2021) ( arXiv)

  • NERS is NN which learns sampling score $\sigma _i$ from replay buffer
  • RL policy is trained like PER except using inferred $\sigma _i$ instead of $\mid \text{TD}\mid$ for priority
  • NERS is trained for replay reward: $r ^{\text{re}} = \sum _{t\in \text{current episode}} r _t - \sum _{t \in \text{previous episode}}r _t$
    • At episode end, choose subset of indices $I_{\text{train}}$ used for policy update
    • Update NERS with $\nabla _{\phi} \mathbb{E} _{I _\text{train}} [r^{\text{re}}] = \mathbb{E} _{I _\text{train}}\left [ \sum \nabla _{\phi} \sigma _i (D(I _{\text{train}})) \right ]$

6. Neural Experience Replay Sampler (NERS)

Y. Oh et. al., “Learning to Sample with Local and Global Contexts in Experience Replay Buffer”,ICLR (2021) ( arXiv)

  • NERS consists 3 networks, local network $f _l$, global network $f _g$, and score network $f _s$
  • Input features $D(I) = \lbrace s _{\kappa(i)}, a _{\kappa(i)}, r _{\kappa(i)}, s _{\kappa(i)+1}, \kappa(i), \delta _{\kappa(i)}, r _{\kappa(i)} + \gamma \max _a Q _{\hat {\theta}} (s _{\kappa(i)+1},a) \rbrace _ {i \in I}$
    • $\kappa (i)$: time step, $\delta _{\kappa (i)}$: TD error, $Q _{\hat {\theta}}$: target network
  • Output of $f _g$ are averaged and concatenated to each output of $f _l$
  • $f _s$ infer final score $\sigma _i$ from the concatenated local-global features.

6. Neural Experience Replay Sampler (NERS)

Y. Oh et. al., “Learning to Sample with Local and Global Contexts in Experience Replay Buffer”, ICLR (2021) ( arXiv)

6. Neural Experience Replay Sampler (NERS)

Y. Oh et. al., “Learning to Sample with Local and Global Contexts in Experience Replay Buffer”, ICLR (2021) ( arXiv)

6. Neural Experience Replay Sampler (NERS)

Y. Oh et. al., “Learning to Sample with Local and Global Contexts in Experience Replay Buffer”, ICLR (2021) ( arXiv)

End

Thank You for Reading